DECLARATION OF PAUL MCKENNEY 



1 . I am an inventor of the patent application having the serial number 
1 0/679,076, filed on October 4, 2003, having docket number 
BEA920030004US1. 

2. Attached hereto as Exhibit A is the disclosure document regarding the 
invention of the present patent application that I prepared and submitted to the 
IBM Legal Department on January 30, 2003. 

3. The invention of which I am an inventor is described in the "Main Idea" 
section of the disclosure document, as well as in the Word document 
referenced in this section of the disclosure document, which has been 
appended to the disclosure document, 

4. I along with the other inventor prepared these disclosure materials prior to 
January 30, 2003. This is proven insofar as page 1 of the disclosure document 
indicates that I submitted this disclosure document on January 30, 2003. 

I hereby declare that the above statements are true, to the best of my recollection. 
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Main Idea for Disclosure BEA8-2003-0003 

Prepared for and/or by an IBM Attorney - IBM Confidential 



Archived On 06/07/2003 10:04:42 PM 



Title of disclosure (in English) 

Adaptive Locking Primitives for Computer Architectures Supporting Transactions 
Main Idea 

1. Background: What is the problem solved by your invention? Describe known solutions 
to this problem (if any). What are the drawbacks of such known solutions, or why is an 
additional solution required? Cite any relevant technical documents or references. 

There have been a number of proposals, dating back to Herlihy's and Moss's proposal in 1 993, 
for transactional memory. Transactional memory can be thought of as way of bracketing a 
section of code such that it turns into a big, multi-argument LL/SC. The section of code is 
executed speculatively, and the decision to commit the changes is deferred until the end of the 
code segment. If there has been any interference with any of the data used by the critical 
section (e.g., one of the cache lines read by the critical section having been invalidated), the 
transaction is aborted. Moss and Herlihy required that all the important memory operations in 
the critical section be tagged, but more recent proposals within IBM have omitted the tagging, 
making the result more programmer-friendly. However, transactions can still fail due to 
hardware limitations in cache size and associativity, so there needs to be some sort of fallback 
mechanism in order to guarantee forward progress. This invention describes a method of 
augmenting existing locking primitives so that they use transactional mechanisms, but fall back to 
traditional software locking if necessary. This allows current source code to adaptively use 
transactions, gaining their benefits when feasible, but making forward progress via traditional 
software mechanisms when not. Code using these adaptive locking primitives need not change 
at all. 

2. Summary of Invention: Briefly describe the core idea of your invention (saving the 
details for questions #3 below). Describe the advantage(s) of using your invention 
instead of the known solutions described above. 

Maintain an extra bit in the lock word that indicates whether software locking is to be used. 
Define a hardware pseudo-transaction primitive that allows the spin_unlock() to determine 
whether a real transaction would have succeeded had it been used. The spin_unlock() primitive 
can then use a number of heuristics to determine when to switch the lock back into transactional 
mode. Other heuristics may be used by spin_lock() in order to determine whether to fall back to 
software locking should the transaction fail. 

3. Description: Describe how your invention works, and how it could be implemented, 
using text, diagrams and flow charts as appropriate. 



AdaptiveTxnLock.2003.01 .30a.d< 



Adaptive Locking Primitives for Computer 
Architectures Supporting Transactions 

There have been a number of proposals based on transactional memory, which either provide the 
transactional semantics directly to the user or do hardware-based transactional execution of locking 
primitives as they are encountered. Transactional memory is normally subject to hardware limitations 
stemming from restricted cache sizes or restricted cache associativity. Transactions that exceed the cache 
size or the associativity limits can never succeed, so there is a need to have a recovery mechanism that 
allows the work to be completed. The approach chosen here uses locking primitives that adaptively 
execute transactional^, but fall back to a software locking protocol should the transaction fail. If false- 
transaction-begin and false-transaction-check primitives are provided, this implementation can adaptively 
revert back to a transactional approach should a given critical section's scope be reduced. It is also possible 
for the compiler to estimate the probability of the transactional approach succeeding on a given critical 
section, and having the adaptive algorithm incorporate this information. 



1 Related Work 

Related work includes: 



1) Transactional memory implementation and use: 

a) Martinez, J. F. and Torrellas, J. -"Speculative Synchronization: Applying Thread-Level 
Speculation to Explicitly Parallel Applications", International Conference on Architectural 
Support for Programming Languages and Operating Systems (ASPLOS), 2002. Key concept is 
that there be at least one "safe thread" that executes non-speculativesly, guaranteeing forward 
progress. Provides ssu_lock(), ssu_spin(), and ssu_idle() instructions so that software can 
explicitly enable speculation. (PERCS differs in providing transactional interface.) 
http://citeseer.nj.nec.com/^^ 

doczSzasplos02.pdf/martinez02speculative.pdf Similar paper published in 2001 at the Workshop 
on Memory Performance Issues (WMPI), at International Symposium on Computer Architecture 
(ISCA). 

http://citeseer.nj.nec.eom/cache/papers/cs/23819/http:zSzzSziacoma.cs.muc.eduzSziac 
paperszSzwmpi locks.pdf/martinezO 1 speculative.pdf 

b) Raj war, R. and Goodman, J. R. "Transactional Lock-Free Execution of Lock-Based Programs", 
ASPLOS 2002). Uses timestamp-based mechanism to remove locking. Hardware-only proposal 
that speculatively converts locking primitives into transactional execution constructs. PERCS 
differs by having explicit transaction support. 

http://citeseer.nj.nec.eom/cache/papers/cs/26669/http:zSzzSzwww.cs.wisc.eduzSz-rajwarzSzpape 
rszSzasplos02.pdf/rajwar02transactional.pdf Similar paper entitled "Speculative Lock Elision: 
Enabling Highly Concurrent Multithreaded Execution" appeared in the 34 th International 
Symposium on Microarchitecture, Dec 3-5 2001 Austin, TX. 

http://citeseer.ni.nec.eom/cache/papers/cs/25758/http:zSzzSzwww.cs.wisc.eduzSz-raiwarzSzpape 
rszSzmicroO 1 .pdf/raj warO 1 speculative.pdf 

c) Oplinger, J. and Lam, M. S. "Enhancing Software Reliability with Speculative Threads", 
Proceedings of the 2002 Conference on Architectural Support for Programming Languages and 
Operating Systems. Programmer must supply monitoring functions that validate global state. 
Provides TRY, COMMIT, and ABORT constructs similar to C++ or Java exception blocks. 
http://citeseer.nj.nec.eom/cache/papers/cs/26363/http:zSzzSzsuif.stanford.eduzSzpapersz 
O2.pdf/enhancing-software-reliability-with.pdf 

d) http://citeseer.nj.nec.com/cach^ 
SzdetlefszSzpaperszSz00-deque2.pdf/detlefs00even.pdf 

e) Herlihy, M. and Moss, E. "Transactional Memory: Architectural Support for Lock-Free Data 
Structures", Proceedings o fthe Twentieth Annual International Symposium on Computer 
Architecture. 1993. Provides transactional loads and stores, along with commit, validate, and 
abort instructions. (PERCs differs in providing transaction begin instruction instead of explicit 



transactional loads and stores. PERCS handles raw code better, and Herlihy/Moss makes more 
efficient use of the hardware. PERCS would seem to have technological trends on its side.) 
http://www.cs.brown.edu/people/mph/H^ 

2) Other hardware locking optimizations: 

a) Kagi, A. "Mechanisms for Efficient Shared-Memory, Lock-Based Synchronization" Ph.D. 
Thesis at U. of Wisconsin-Madison. Defines "VAQUM" primitive that ships data related to a 
given lock to the next CPU that will acquire the lock. Sort of a NUMA-aware lock on steroids. 
PERCS supports transactions. 

http://citeseer.nj.nec.com/cacfe 

kagi thesis.pdf/kagi99mechanisms.pdf 

b) Nakun Seong, Naihoon Jung, Byungho Kim, H. Yoon, and Jung W. Cho. "Intelligent Memory: 
An Architecture for Lock-Free Synchronization". 1996, no clue where published. Ship the 
critical section to the memory module containing the data that the critical section operates on. 
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c) J. M. Stone, H. S. Stone, P. Heidelberger, and J. Turek. Multiple reservations and the Oklahoma 
update. IEEE Parallel & Distributed Technology, 1(4):58~71, Nov. 1993. Uses a multi-operand 
variant of LL/SC, if I remember correctly ~ cannot find the paper online! ! ! (PERCS differs in 
providing transaction begin and end instructions.) 

3) Software exploitation of CAS and DCAS: 

a) David L. Detlefs, Christine H. Flood, Alexander T. Garthwaite, Paul A. Martin, Nir N. Shavit, and 
Guy L. Steele, Jr. "Even Better DCAS-Based Concurrent Deques" 2000 International Symposium 
on Distributed Computing. Method of implementing non -blocking concurrent deque with 
memory allocation, using a single DCAS operation per push/pop. PERCS differs by providing 
transactional support. Ignoring further papers based on DCAS, LL/SC, etc. ! 

4) NUMA-aware locking primitives 
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2002, with Benedict Joseph Jackson, Ramakrishnan Rajamony, and Ronald Lynn Rockhold. 
NUMA-aware locking. 
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November 2002, with Kevin Closson and Raghu Malige. More NUMA-aware locking. 

5) Queued locking primitives 
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Department of Information Science, Faculty of Science, University of Tokyo, 7-3-1, Hongo, 
Bunkyo-ku, Tokyo 113 Japan, 1994 IEEE, pp. 2-6. 

b) "Scalable Spin Locks for Multiprogrammed Systems", Robert W. Wisniewski, et al., Department 
of Computer Science, Universtiy of Rochester, Rochester, NY 14627-0226, 1994 IEEE, pp. 583- 
589. 

c) Peter Magnusson, Anders Landin, and Erik Hagersten. Efficient Software Synchronization on 
Large Cache Coherent Multiprocessors. SICS Research Report T94:07, Swedish Institute of 
Computer Science, Box 1263, S-164 28 Kista, Sweden, February 1994. 

d) "QueuingSpin Lock Algorithms to Support Timing Predictability", Travis S. Craig, Department of 
Computer Science and Engineering, FR-35, University of Washington Seattle, WA 98195, 1993 
IEEE, pp. 148-157. 

e) G. Graunke and S. Thakkar. Synchronization Algorithms for Shared Memory Multiprocessors. 
IEEE Computer, 23(6):60~69, 1990. 

f) "The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors", Thomas E. 
Anderson, IEEE Transactions on Parallel and Distributed Systems, vol. 1, No. 1, Jan. 1990, pp. 6- 
16. 

6) Compiler transfer functions to measure the "weight" of a code segment. 

a) Wilson, R. P. and Lam, M. "Efficient Context-Sensitive Pointer Analysis for C Programs", In 

Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and 
Implementation, La Jolla, CA, June 18-21, 1995, pp. 1-12. 



2 Memory Transaction Primitives 

The adaptive locking algorithm requires the following memory-transaction primitives: 

• int begin txn ( ) marks the start of a transaction. It returns TRUE. If a given transaction is 
aborted by the hardware, execution resumes after the corresponding begin txn ( ) , which 
returns FALSE. 

• void begin txn check ( ) marks the start of a pseudo-transaction. A pseudo-transaction 
does not affect the action of the cache system, except to track whether a real transaction would 
have succeeded. The adaptive locking algorithms use this facility to determine when it is safe to 
switch back from software locking to hardware transactions. 

• void commit txn ( ) marks the end of a transaction. All memory writes that were 
speculatively executed since the matching begin txn ( ) are now made permanent and visible 
to other processors. This primitive will also silently end the effect of a matching 

begin txn check ( ) primitive. One could also construct a commit txn ( ) that flagged 
this situation as an error, and there may be advantages to this more strict approach. However, if 
such an approach is used, it would be wise to provide a separate primitive that will process the end 
of either a real transaction or a pseudo-transaction in order to simplify software implementations. 

• int end txn check ( ) marks the end of a pseudo-transaction. It returns TRUE if a real 
transaction would have succeeded. 

• void abort_txn (int mimic_hw) aborts the current transaction. If mimic_hw is TRUE, 
then execution resumes with the matching begin txn ( ) returning FALSE, otherwise, 
execution continues after the abort txn ( ) . It is illegal to pass TRUE to an abort txn ( ) 
that matches a begin txn check ( ) . It may be useful to have begin txn check ( ) (or a 
variant) return a TRUE/FALSE value so that abort txn ( ) , but no useful algorithm exploiting 
this has been proposed. One can argue that a hardware implementation should provide this 
functionality in case it is needed, but one can also argue that this functionality could be efficiently 
implemented in software. 

More thought is needed for these primitives, but this set suffices to implement the adaptive locking 
algorithm. 



3 Adaptive Locking Algorithm 



typedef atomic_t txn_lock; 
#define TXN_LOCK_HELD 0x01 
#define TXN_LOCK_DOLOCK 0x10 

spin_lock ( txn_lock *tp ) 
{ 

int oldval; 
int newval; 

for (;;) { 

oldval = atomic_read ( tp ) ; 

if (oldval & TXN_LOCK_DOLOCK) { 

while ( (oldval = atomic_read ( tp ) ) == 

TXN_LOCK_DOLOCK J TXN_LOCK_HELD ) { 
continue; 

} 

if (oldval != TXN_LOCK_DOLOCK) { 
continue; 

} 

begin txn check ( ) ; 

newval = oldval | TXN_LOCK_HELD; 

if ( cmpxchg ( &tp, oldval, newval) != oldval) { 

abort_txn ( FALSE ) ; 

continue; 

} 

} else { 

if ( !begin_txn( ) ) { 

/* HW aborted the txn, try again. */ 

oldval = atomic read(tp); 

if ((oldval & TXN_LOCK_DOLOCK) == 0) { 

/* Can use more sophistication. */ 
newval = oldval | TXN_LOCK_DOLOCK; 
(void) cmpxchg (tp, 

oldval , 
newval ) ; 

} 

continue; 

} 

oldval = atomic read(tp); 

if (oldval & TXN_LOCK_DOLOCK) { 

abort_txn ( FALSE ) ; 

continue; 

} 



spin_unlock ( txn_lock *tp) 
{ 

int nextval ; 
int oldval; 

if ((oldval = atomic_read(tp) ) != TXN_LOCK_DOLOCK | TXN_LOCK_HELD ) { 
commit_txn ( ) ; 

} else { 

/* Can use more sophistication . * / 
if ( end_txn_check ( ) ) { 
newval = 0; 

} else { 

newval = TXN_LOCK_DOLOCK; 

} 

while ( (nextval = 

cmpxchg ( tp, 

oldval , 

newval)) != oldval) { 

oldval = nextval; 

} 



Additional sophistication: 



1. Digital filter, maintaining state within lock word. This would provide some hysteresis, which 
might be useful in some situations. 

2. Hints from compiler passed in as hidden argument to spin_unlock(). The compiler could compute 
a score based on the notion of "transfer functions" described by Wilson. This would count the 
expected number of memory references in the expected critical section, and produce some 
function of this number ~ the number of references to distinct cache lines being one approach. 
Compilers that have full awareness of the hardware structures (cache size, associativity, and other 
limitations on the transactions) could produce a better estimate of the likelihood of transaction 
success. The spin_unlock() primitive would then be more aggressive in clearing 

TXN LOCK DOLOCK in cases where transactions were more likely to succeed. 

3 . Keep per-lock-caller state, sort of like branch-prediction tables in CPUs. The idea here is that the 
same lock will often be used for multiple critical sections with wildly differing cache footprints. 
The spin_lock() primitive would record its address in the lock word when acquiring the lock in 
software, and the spin_unlock() primitive would measure the transaction -completion success rate 
on a per-spin_lock() basis. The spin_lock() primitive would then more aggressively use 
transactions on code paths where they had good records of success. 

4. It is also possible to track success rates when transactions are in use, but the implementation must 
take care, as the act of tracking the success rate will cause the transactions to be more likely to fail. 
The difficulty is that spin_lock() must record its identity so that the corresponding spin_unlock() 
can communicate the measurements. Keep in mind that there can be many -to-many relationships 
between spin_lock() and spin_unlock() primitives. One approach is for spin_lock() to record its 
identity in a machine register. This limits the accuracy of the measurements in cases where 
critical sections are deeply nested, and also increases the frequency of compiler register-spill 
operations, which in turn increases the cache footprint of the critical section, decreasing the 
probability of transaction success. A per-CPU (cache aligned and padded!) array in memory can 
increase the measurement accuracy, but this again increases the memory footprint. 

5. Count the number of times a given code path has failed, making spin_lock() more likely to use 
software locking in cases where there have been multiple failures. The example given above uses 
software locking after a single failure, but could be made more aggressive by forcing looping on 
the cmpxchgO primitive. 

6. A queued lock or NUMA-aware lock could be used to increase efficiency in high-contention 
situations. 

7. This could be adapted to reader-writer locks. 



